Spatial database of planted forests in East Asia

Planted forests are critical to climate change mitigation and constitute a major supplier of timber/non-timber products and other ecosystem services. Globally, approximately 36% of planted forest area is located in East Asia. However, reliable records of the geographic distribution and tree species composition of these planted forests remain very limited. Here, based on extensive in situ and remote sensing data, as well as an ensemble modeling approach, we present the first spatial database of planted forests for East Asia, which consists of maps of the geographic distribution of planted forests and associated dominant tree genera. Of the predicted planted forest areas in East Asia (948,863 km2), China contributed 87%, most of which is located in the lowland tropical/subtropical regions, and Sichuan Basin. With 95% accuracy and an F1 score of 0.77, our spatially-continuous maps of planted forests enable accurate quantification of the role of planted forests in climate change mitigation. Our findings inform effective decision-making in forest conservation, management, and global restoration projects.

satellite images, where "planted forest" was one of the attributes of vegetation types (Fig. 2b) 12 . This "planted forest" attribute includes restoration-oriented forests composed of broadleaf species, commercial forests dominated by productive species like Japanese cedar (Cryptomeria japonica) and Hinoki cypress (Chamaecyparis obtusa), and disaster prevention planting, such as Japanese black pine (Pinus thunbergii) from coastal erosion and tropical species (e.g., Acacia confusa) as windbreaks. The national vegetation map has been gradually developed and improved since 2005. Finally, data specific to ROK was a polygon map of planted and natural forests from the national forest cover map (Fig. 2b) 16 . The ROK maps depict the distribution of planted and natural forests from 2009 to 2013, depending on the province. In addition to the country-specific data, we obtained the Spatial Database of Planted Trees covering China, Japan, and ROK (SDPT version 1.0; Fig. 2c) 13 and a global extent of planted trees 2015 14 , which includes the land use classes of planted forest, woody plantations, and agroforestry of the global forest management map 266 (Fig. 2d). There is no data specific to DPRK used in this study due to the lack of available data.
To prepare a training dataset for machine learning classification models, we prepared a 0.009° by 0.009° grid (approximately 1 km 2 ) for the study region in East Asia. National planted forest maps of China 15 , Japan 12 , and ROK 16 , as well as SDPT 13 and the Global Extent of Planted Trees 14 were extracted to the centroid of each grid cell using the "sf " or "raster" packages in R 267,268 . China's in situ observations were associated with each grid cell by taking the majority vote of in situ points within each grid cell to determine whether that cell is a planted or natural forest. Grid cells with a 50/50 vote were removed from the training dataset. We then derived the response variable -a label of "planted" or "natural" forest -based on these underlying datasets following the Quality-Oriented Data Integration (QODI).

Quality-oriented data integration (QODI).
Since the underlying datasets differed in data sources and estimation methods, we developed a quality-oriented data integration approach in which the response variable was defined in three different levels of integration (Fig. 3). For each level of integration, we trained a separate set of machine learning models, so that we can quantify the potential range in estimated planted forest areas.
The first level of integration took the most conservative approach in deriving the lower bound of our estimation. Since China's in situ observations  , Japan's national vegetation map 12 , and ROK's national planted forest map 16 were largely based on in situ observations, we labeled a unit forest area (i.e., grid cell) as planted if and only if the grid cell was identified as a planted forest by either of these in situ-based datasets or identified by at least three other datasets as a planted forest.
The second level of integration took a midway approach in which, in addition to planted forests identified in the first level of integration, a given grid cell was also labeled as a planted forest if two out of the national planted forest maps of China 15 , SDPT 13 , and Global Extent of Planted Trees 14 datasets agreed so.
The third level of integration took the most liberal approach in deriving the upper bound of our estimation, in which we assumed all underlying data sources were equally reliable and labeled a given grid cell as planted forest if it was identified as a planted forest by either of these datasets.

Fig. 1
Workflow for developing the spatial database of planted forests. The top section (yellow) represents the data fusion algorithm we used to integrate multi-source data into coherent training datasets. The bottom section (green) represents the ensemble model we developed to predict the spatial patterns of planted forests.
We also compiled 57 predictor variables for the supervised learning of the classification models ( Fig. 1, Supplementary Table S1). The predictor variables consisted of five forest structure attributes 269,270 , seven MODIS-derived vegetation characteristics, 21 bioclimatic attributes 271-274 , 13 topographic attributes 275 , four anthropogenic attributes [276][277][278][279] , and seven soil attributes 280 . We obtained four forest structure attributes from the most recent Global Ecosystem Dynamics Investigation (GEDI) dataset, namely canopy height (rh100), plant area index (pai), foliage height diversity (fhd_normal), and total canopy cover (cover) (see Supplementary  Table S1) 18,269 . We downloaded the raw footprint-level GEDI data (L2B), among which only full-power lasers were used in this study to ensure the accuracy of the measurement. GEDI data was processed using the "rGEDI" package in R 281 . Another forest structure attribute, tree height 270 , represents the 90 th or 95 th percentile of energy return height relative to the ground.
We extracted predictor variables to the centroid of each grid cell using the "sf " or "raster" packages in R 267,268 . GEDI footprint-level data was associated with each grid cell by taking the mean value of each attribute. We kept only grid cells with a minimum of 5 m tree height in accordance with FAO's definition of "forest" 2,270 . Our final training dataset encompassed more than 1.5 million grid cells for the upper bound dataset, 1.0 million grid cells for the midpoint dataset, and 0.9 million grid cells for the lower bound dataset, consisting of one response variable labeled as either "planted" or "natural" and 57 predictor variables. Finally, to account for the differences in terrestrial ecoregions, we divided the overall training dataset into three biomes (Fig. 2e). Based on the global terrestrial biome map 19 , Temperate Grassland/Savanna and Montane and Flooded Grassland were grouped into "Temperate Grassland". Temperate Broadleaf and Mixed and Temperate Conifer were grouped into "Temperate Forest", and Tropical Moist, Tropical Dry, and Tropical Grassland/Savanna were grouped into "Tropical Forest and Savanna. " The three biomes remained separated for the upper bound dataset, but Temperate Grassland and Temperate Forest were merged for the midpoint and lower bound datasets to form the "Temperate Forest and Grassland" biome due to low sample size in Temperate Grassland.
For mapping purposes, we prepared another 0.009° by 0.009° grid (approximately 1 km 2 ), covering forested area (≥5 m tree height) 2,270 in the study region with all predictor variables (new data; Fig. 2f). We chose the resolution 0.009° to align with most of the predictor variables (Supplementary Table S1). After a machine learning classification model was trained, estimation was made for each grid cell of this new data. For ROK and a www.nature.com/scientificdata www.nature.com/scientificdata/ majority of areas in Japan, however, we utilized the existing planted forest maps, namely the national forest cover map of ROK 16 and the national vegetation map of Japan ( Fig. 2b) 12 , respectively, to label the grid cells. Since reliable planted forest data already exist for these areas, we used our estimation only for the remaining areas in China, DPRK, and a small portion of Japan (Fig. 2f). Nevertheless, the existing data for ROK and a majority of areas in Japan were converted to the 0.009° resolution within the forested area for consistency. For the areas where our estimation is used, we imputed missing values in predictor variables of the new data using the "Hmisc" package in R 282 to provide a spatially continuous map. For the GEDI attributes (Supplementary  Table S1), however, we imputed missing values by training random forest (RF) models (see below for details of RF) with seven MODIS attributes due to a large number of missing values (22%, 34%, and 44% of the sample size for the upper bound, midpoint, and lower bound dataset, respectively). For the midpoint and lower bound datasets, we used the average predicted values from 10 repetitions of random forest models using 200,000 data points to minimize computational time (Table 1). To assess the performance of the RF model in imputing missing values in GEDI attributes, we performed cross-validation using bootstrapping. For the upper bound dataset, we randomly sampled the dataset into the training (90%) and testing (10%) sets with replacement. For the midpoint and lower bound datasets, we randomly sampled 200,000 data points for the training sets with replacement, and the remaining was used as the testing dataset (Table 1). Based on 20 random iterations, we calculated the 95% confidence interval (CI) of the root mean square error (RMSE) and R-squared (R 2 ). We calculated a 95% CI using the t 0.975 value with 19 degrees of freedom. Ensemble machine learning model. We developed an ensemble model to estimate the spatial distribution of planted forests, with three candidate machine learning models: RF, support vector machines (SVM), and XGBoost. RF is a non-parametric ensemble learning approach 283 , which combines a variant of decision trees and an additional level of randomness by bootstrapping sub-data and different sets of predictor variables to mitigate potential multicollinearity issues often encountered in multidimensional machine learning models 284 . We used the "randomForest" package in R 285 . SVM is a supervised learning model which constructs a hyperplane or set of hyperplanes in a high-or infinite-dimensional space to help data analysis 286 . We used the "e1071" package in R 287 . XGBoost is a gradient-boosted decision tree machine learning, designed to accommodate large data at high speed. We used the "xgboost" package in R 288 . The three candidate models are frequently used in ecological and biological research with satisfactory performance 266,289 . Other potential candidate models include artificial neural networks, k-nearest neighbor, Naïve Bayer, etc., which are not necessarily superior 290 . All modeling processes were conducted in R 291 .
To assess the performance of the three candidate models in estimating planted forests, we conducted cross-validation using bootstrapping. Due to data size, we randomly sampled 50,000 points (25,000 for each class) for the upper bound and midpoint datasets and 80% of the sample points for the lower bound dataset for each of the ten repetitions to create the training set and the rest composed the testing set (Table 1). Default hyperparameter values were used for the three candidate models. Based on 10 iterations, we calculated the The response variable ("planted" or "natural" forest) was defined in a quality-oriented data integration approach based on multiple underlying data sources. Underlying datasets a-d correspond to Fig. 2a-d. Upper and lower bound models represent the most liberal and conservative approaches in labeling planted forest, respectively. The grey area was removed from the respective training dataset. All areas outside of the Venn diagrams were labeled natural forest. DPRK is not included in this figure due to the absence of training data associated with the country.
www.nature.com/scientificdata www.nature.com/scientificdata/ 95% CI of classification accuracy and F1 score. We calculated a 95% CI using the t 0.975 value with 9 degrees of freedom. Classification accuracy shows the proportion of overall correct prediction. While accuracy is the most widely used and intuitive evaluation metric of a classification problem, it overestimates the performance of imbalanced data. F1 score is an equal measure of precision and recall and is more appropriate for imbalanced data 292 . Precision represents the correct prediction of the positive class (i.e., planted) among all positive predictions, and recall represents the correct prediction of the positive class among all actual positive cases 293 . Since precision and recall are in an inverse relationship, the combined metric, F1 score, provides a better evaluation perspective of incorrectly predicted cases. Using both accuracy and F1 score, we present a suite of evaluation metrics of our candidate models for both correct and incorrect predictions of an imbalanced dataset. Other potential evaluation metrics include Cohen's Kappa. However, we did not use it in our study due to the controversy of its use 294 . Compared with SVM and XGBoost, the RF model was 0.7-8.1% more accurate in terms of overall classification accuracy and 1.4-4.5% more reliable in terms of F1 score (Fig. 4). Thus, we chose RF as the final model.
To improve the performance of the model while minimizing the time it takes to compute, we adjusted two hyperparameters of the RF algorithm: the number of decision trees and the number of predictor variables. Similar to the cross-validation described above, we randomly sampled 50,000 points (25,000 for each class) for the upper bound and midpoint models and 80% of the sample points for the lower bound model for each of the ten repetitions to assess RF performance using different hyperparameter values (Table 1). Specifically, we calculated the classification accuracy and F1 score for different hyperparameter values. Based on 10 iterations, we chose the number of 100 decision trees for the upper bound and midpoint models and 200 for the lower bound model where both accuracy and F1 score converged (Fig. 5). We used the default number of predictor variables (seven) for all biomes for the upper bound model. We chose 26 and 42 for Temperate Forest and Grassland and Tropical Forest and Savanna, respectively, for the midpoint model (Fig. 6). We chose 20 and 40 for Temperate Forest and Grassland and Tropical Forest and Savanna, respectively, for the lower bound model (Fig. 6).
For the final RF model, we ensured that the training set had an equal number of points for each class (i.e., 50% planted forest and 50% natural forest) by randomly under-sampling the dominant class. The prediction of our classification model was the percent planted forest based on how many decision trees returned the "planted" prediction. We built 20 models to derive the mean percentage for each biome and model (upper bound, midpoint, and lower bound) (Table 1). Finally, we calculated the mean percentage of the three models as a final value, while upper and lower bounds serve as a potential range (Fig. 7). Grid cells with a predicted percentage ≥50% are considered planted forest (Fig. 8). Using the spatially continuous dataset of 57 predictor variables (see Data fusion), we created a map covering the entire forested area in East Asia using model prediction.
Mapping dominant tree species of the planted forests. Over the planted forest expanse in East Asia identified by the final RF classification model, we predicted the dominant tree species (to the genus level) of the planted forest for each criterion (Fig. 9). For the training set, we combined 2,481 in situ records in China 20-265 with the tree-level records of Japan 295 and ROK 296 National Forest Inventories (NFI). Specifically, we calculated importance value for each species for each NFI plot within the predicted planted forest expanse and identified the species www.nature.com/scientificdata www.nature.com/scientificdata/ with the highest importance value as the dominant species for the given plot. Importance value is the sum of the percent basal area and the percent number of individuals of each species and represents the overall dominance of the species 297,298 . After identifying the dominant species for each NFI plot, we aggregated the plots into the 0.009° by 0.009° grid cells by taking the majority vote of the dominant species. We retained the genus names of the dominant species, and only genera with 60 or more samples were included to ensure a sufficient size of training data.
We trained an RF classification model using the same package in R, with the default hyperparameter setting and an identical set of predictor variables, except for roadless areas and GEDI attributes due to a substantial number of missing values (86% and 34% of the sample size, respectively). We ensured that the training set had an equal number of points for each class (i.e., genus) by combining random under-sampling and oversampling using the "UBL" package in R 299 . To assess the performance of the RF model in mapping dominant genera across the planted forest expanse in East Asia, we performed a 90/10 cross-validation using bootstrapping. In each iteration, we used stratified sampling to split the entire training dataset into the training (90%) and testing (10%) sets using the "caret" package in R 300 and conducted a combination of under-sampling and oversampling of the training set to address the class imbalance (Table 1). Based on 100 random iterations, we calculated the 95% CI of overall classification accuracy and precision, recall, and F1 score for each class.

Data Records
The spatial database of planted forests consists of maps of estimated planted forest distribution (Figs. 7, 8) and dominant tree species (Fig. 9)   www.nature.com/scientificdata www.nature.com/scientificdata/ Prc_Pln: Percent planted forest. The values represented the average of the three models (upper bound, midpoint, and lower bound). NA for ROK and a majority of areas in Japan, where national planted forest maps 12,16 were used as the final planted/natural label (Fig. 2f). Prc_P_U: Percent planted forest predicted by the upper bound model. NA for ROK and a majority of areas in Japan, where national planted forest maps 12,16 were used as the final planted/natural label (Fig. 2f). Note that values are not always higher than Prc_Pln. Prc_P_L: Percent planted forest predicted by the lower bound model. NA for ROK and a majority of areas in Japan, where national planted forest maps 12,16 were used as the final planted/natural label (Fig. 2f). Note that values are not always lower than Prc_Pln. Type: "Planted" or "Natural" forests based on the main result (i.e., the average of the three models). For our predicted percent planted forest, "Planted" if Prc_Pln ≥ 0.5 and "Natural" if Prc_Pln < 0.5. For Prc_ Pln = NA, national planted forest maps 12,16 were used to determine if the given polygon is a planted forest, and if not, "Natural. " Typ_Upp: "Planted" or "Natural" forests based on the upper-bound model. Typ_Lwr: "Planted" or "Natural" forests based on the lower-bound model. Genus: For Type = "Planted", this attribute indicates the predicted dominant genus. NA for Type = "Natural".  www.nature.com/scientificdata www.nature.com/scientificdata/ Raster layers are also available for percent planted forest, type (planted or natural forest), and dominant genus, at https://doi.org/10.6084/m9.figshare.21774725.v3 301 .
Based on our prediction, the total area of planted forests in East Asia was 948,863 km 2 , ranging between 600,529 and 1,277,549 km 2 . China shared 87% of the planted forest area in East Asia, most of which is in the lowland subtropical and tropical regions, and Sichuan Basin (Fig. 8). More than half of China's planted forest area was dominated by Cunninghamia (Table 2) in the subtropical region and Sichuan Basin (Fig. 9). Larch (Larix spp.), black locust (Robinia spp.), and pine (Pinus spp.) were widely observed in northern and central China, and eucalyptus dominated planted forests in tropical regions.
In Japan and ROK, planted forests were uniformly distributed across the country (Fig. 8). More than half of Japan's total planted forest area was Chamaecyparis-or Cryptomeria-dominant (Table 2), while other coniferous genera (e.g., Abies and Pinus) covered northern planted forests (Fig. 9). ROK's planted forests were characterized by diverse genera; more than half of planted forest areas were dominated by pine, followed by deciduous trees including oak (Quercus spp.) and chestnut (Castanea spp.). DPRK's planted forests were mainly distributed in the south, largely composed of oak, larch, and pine.
The input training data, including the response variable and predictor variables, used in this study are available at https://doi.org/10.6084/m9.figshare.21774812.v2 305 . Underlying data included in situ and digitized planted-natural forest data: The in situ observational data of China  The Japan Vegetation Map 12 16 , and the ROK National Forest Inventory 296 . The sensitive information in these datasets will be available upon request via Science-i (https://science-i.org/) and approval from data contributors.  Table S1; see Quality-Oriented Data Integration (QODI) in Methods). R 2 was within the range of 31% and 42% for all the GEDI attributes in Temperate Grassland and Temperate Forest (Table 3). For Tropical Forest and Savanna, canopy height showed R 2 of 22%, and the rest of the attributes showed R 2 of almost 30%. Foliage height diversity showed the highest R 2 and total canopy cover showed the lowest root mean square error (RMSE) among all GEDI attributes in all groups (Table 3).

Model validation in estimating planted forests.
To evaluate the performance of our mapping product of East Asia, we compared our main prediction (Fig. 8) with the planted/natural labels of the midpoint dataset for China. We calculated classification accuracy, precision, recall, F1 score, and four elements of confusion matrices in percentage (true positive, false positive, false negative, and true negative, where positive class represented planted, and negative class represented natural forest). Our prediction is characterized by a high recall (0.99), indicating that 99% of the observed planted forests were correctly predicted as planted forest (Table 4). Our precision was 0.63, which indicates that approximately two out of three positive predictions are actually planted forests. This level of accuracy is similar to those of other large-scale forest mapping studies (0.60-0.80) [306][307][308] .
While precision is often negatively associated with recall, the F1 score, 0.77, indicates that our model is well-balanced between precision and recall. The low precision is attributable to the imbalanced distribution of positive and negative classes in the validation set (the midpoint dataset for China). The number of samples for natural forests was almost 10 times greater than that of planted forests in our validation set (Table 4). While we maximized the predictive performance by balancing the training data, high accuracy and low precision are inevitable due to the imbalanced validation set.
To further validate the quality of our prediction, we also compared our estimated total area of planted forests against the reported values from the FAO Global Forest Resources Assessment (FRA) 2 and the National  (Table 5). Our total predicted area of planted forests in East Asia was 948,863 km 2 with a range between 600,529 and 1,277,549 km 2 , which is consistent with the FRA estimate (981,390 km 2 ). The predicted area of China's planted forests was 825,751 km 2 (475,566-1,159,009 km 2 ), while   Table 3. Evaluation in imputing missing data of GEDI attributes for mapping purposes. We conducted crossvalidation with bootstrapping. The mean and 95% confidence interval (CI) from 20 iterations are shown for root mean square error (RMSE) and R-squared (R 2 ).  Table 4. Evaluation metrics and elements of confusion matrices of the main prediction of planted forest distribution. Our final prediction was evaluated against the planted/natural labels of the midpoint dataset in China. The elements of confusion matrices are represented in percentages. The positive class represents planted, and the negative class represents natural forests. Accuracy shows the proportion of overall correct prediction, precision represents the correct prediction of the positive class (i.e., planted) among all positive predictions, recall represents the correct prediction of the positive class among all actual positive cases, and F1 score represents a balanced score of precision and recall.  www.nature.com/scientificdata www.nature.com/scientificdata/ Cryptomeria, Picea, Pinus, Quercus, and Tilia showed low recall compared to precision, indicating that true labels for these genera tended to be classified as other genera. Abies, Carpinus, Castanea, Castanopsis, Chamaecyparis, Cunninghamia, Eucalyptus, Fagus, Ilex, Larix, and Robinia had lower precision than recall due to the overprediction of these genera (Table 6).

Genus
Precision (mean ± 95%CI) Recall (mean ± 95%CI) F1 score (mean ± 95%CI)  Table 6. Evaluation of the random forest classification model in mapping the dominant tree species across the planted forest expanse in East Asia. We conducted a rigorous 90/10 bootstrapping cross-validation. The mean and 95% confidence interval (CI) are shown for the precision, recall, and F1 score of each class (i.e., genus) based on 100 random iterations. www.nature.com/scientificdata www.nature.com/scientificdata/ Uncertainties. While this study advances the current understanding of planted forests in East Asia based on multi-source data consisting of in situ, digitized, and modeled datasets, uncertainties arose from two main sources. First, limited in situ data, especially from Japan, ROK, and DPRK constitute one of the largest sources of uncertainties. The limited in situ data from these countries could lead to lower accuracy in our planted forests prediction. Nevertheless, to mitigate this uncertainty, we integrated different data sources for modeling (e.g., SDPT 13 and the Global Planted Trees Extent 2015 14 ), and the final map product for these countries relied on external sources 12,16 .
Secondly, our map of planted tree species depicts the spatial distribution of the dominant tree species to the genus level across the range of planted forests. However, it is beyond the scope of this study to identify the spatial distribution of monoculture planted forests versus mixed-species planted forests, the latter of which are common in certain regions 310 . This uncertainty in tree species richness can be mitigated by integrating the mapping products presented here with recent global high-resolution maps of local tree species richness and co-limitation 289 . Furthermore, some genera predicted in our study had low F1 scores, which can be mitigated by increasing the sample size for these species. Nevertheless, it is not realistic to achieve perfectly balanced data, and differences in predictive performance among genera are inevitable.

Usage Notes
Our final maps of planted forest range (Fig. 8) for Japan and ROK consist of data directly obtained from the national planted forest maps of Japan 12 and ROK 16 . Users of these particular maps should cite these sources accordingly.
Planted forests in this study include forests of all ages that have been planted for ecological restoration, commercial plantation, and other purposes, such as landscape and disaster prevention.
Since the underlying training datasets differ by planting years, we were only able to quantify a roughly estimated range of underlying years. Specifically, we overlaid our final map with two existing map layers with estimated fore st age 302 and planted year 303,304 values. Based on these two sources, some planted forests were planted more than 100 years ago, while other planted forests are less than five years in age (Fig. 10). Estimation based on forest age 302 presents consistency with planting history in each country; the majority of planted forests were established post-war in Japan, followed by efforts in the Korean peninsula, while planted forests in China come from more recent planting (Fig. 10a). We included planted year information in our map product (see Data Records).